Target Tracking Method and Device, and Electronic Apparatus

ABSTRACT

The present disclosure provides a target tracking method, a target tracking device and an electronic apparatus. The target tracking method includes: inputting an i th  image and an (i−1) th  image in a to-be-detected video stream into a target deep learning model, i being an integer greater than 1; detecting a target in the i th  image to obtain a first target detection box, and tracking the target in the (i−1) th  image to obtain a tracking heatmap; and determining a target tracking result in accordance with the first target detection box, the tracking heatmap and the (i−1) th  image.

CROSS-REFERENCE TO RELATED APPLICATION

The present disclosure claims priority to Chinese Patent Application No.202110231514.8 filed in China on Mar. 2, 2021, the entire contents ofwhich are hereby incorporated by reference.

TECHNICAL FIELD

The present disclosure relates to the field of Artificial Intelligence(AI) in the computer technology, in particular to a computer visiontechnology and a deep learning technology, and more particularly to atarget tracking method, a target tracking device, and an electronicapparatus.

BACKGROUND

Target detection and tracking is a basis for various computer visiontasks. Such targets as human beings and vehicles are detected andtracked for the tasks such as pedestrian analysis, smart traffic andunmanned driving.

Currently, usually three algorithms/models are adopted in a targettracking procedure, i.e., a target detection algorithm/model, a featureextraction algorithm/model, and a multi-target tracking algorithm/model.These three algorithms/models are run in serial. A target in an image isdetected through the target detection model to obtain a detection box,then a feature in the detection box of the image is extracted throughthe feature extraction model, and then the target is tracked through thetarget tracking algorithm in accordance with the feature.

SUMMARY

An object of the present disclosure is to provide a target trackingmethod, a target tracking device and an electronic apparatus.

In a first aspect, the present disclosure provides in some embodiments atarget tracking method, including: inputting an i^(th) image and an(i−1)^(th) image in a to-be-detected video stream into a target deeplearning model, i being an integer greater than 1; detecting a target inthe i^(th) image to obtain a first target detection box, and trackingthe target in the (i−1)^(th) image to obtain a tracking heatmap; anddetermining a target tracking result in accordance with the first targetdetection box, the tracking heatmap and the (i−1)^(th) image.

According to the target tracking method in the embodiments of thepresent disclosure, the two adjacent images, i.e., the i^(th) image andthe (i−1)^(th) image, are inputted into the target deep learning model,and the target is detected and tracked through the target deep learningmodel using a predetermined anchor box, so as to obtain a targetdetection result, i.e., the first target detection box and the trackingheatmap. Next, the target tracking result is determined in accordancewith the first target detection box, the tracking heatmap and the(i−1)^(th) image. As compared with a scheme where a target detectionmodel, a feature extraction model and a multi-target tracking model arerun in serial to track the target, in the embodiments of the presentdisclosure, it is able to obtain the first target detection box and thetracking heatmap through one target deep learning model, and determinethe target tracking result in accordance with the first target detectionbox, the tracking heatmap and the (i−1)^(th) image, thereby to improvethe target tracking efficiency.

In a second aspect, the present disclosure provides in some embodimentsa target tracking device, including: an input module configured to inputan i^(th) image and an (i−1)^(th) image in a to-be-detected video streaminto a target deep learning model, i being an integer greater than 1; adetection and tracking module configured to detect a target in thei^(th) image to obtain a first target detection box, and track thetarget in the (i−1)^(th) image to obtain a tracking heatmap; and adetermination module configured to determine a target tracking result inaccordance with the first target detection box, the tracking heatmap andthe (i−1)^(th) image.

In a third aspect, the present disclosure provides in some embodimentsan electronic apparatus, including at least one processor and a memoryin communication with the at least one processor. The memory isconfigured to store therein an instruction executed by the at least oneprocessor, and the at least one processor is configured to execute theinstruction so as to implement the above-mentioned target trackingmethod.

In a fourth aspect, the present disclosure provides in some embodimentsa non-transient computer-readable storage medium storing therein acomputer instruction. The computer instruction is executed by a computerso as to implement the above-mentioned target tracking method.

In a fifth aspect, the present disclosure provides in some embodiments acomputer program product comprising a computer program. The computerprogram is executed by a processor so as to implement theabove-mentioned target tracking method.

BRIEF DESCRIPTION OF THE DRAWINGS

The following drawings are provided to facilitate the understanding ofthe present disclosure, but shall not be construed as limiting thepresent disclosure. In these drawings,

FIG. 1 is a flow chart of a target tracking method according to anembodiment of the present disclosure;

FIG. 2 is another flow chart of the target tracking method according toan embodiment of the present disclosure;

FIG. 3 is a schematic view showing a principle of a target deep learningmodel for the target tracking method according to an embodiment of thepresent disclosure;

FIG. 4 is a schematic view showing a correspondence between a pixelposition in an output from a target detection branch and a trackingheatmap in the target tracking method according to an embodiment of thepresent disclosure;

FIG. 5 is a structural view of a target tracking device according to anembodiment of the present disclosure; and

FIG. 6 is a block diagram of an electronic apparatus for implementingthe target tracking method according to an embodiment of the presentdisclosure.

DETAILED DESCRIPTION

In the following description, numerous details of the embodiments of thepresent disclosure, which should be deemed merely as exemplary, are setforth with reference to accompanying drawings to provide anunderstanding of the embodiments of the present disclosure. Therefore,those skilled in the art will appreciate that modifications orreplacements may be made in the described embodiments without departingfrom the scope and spirit of the present disclosure. Further, forclarity and conciseness, descriptions of known functions and structuresare omitted.

As shown in FIG. 1, the present disclosure provides in some embodimentsa target tracking method, which includes the following steps.

Step S101: inputting an i^(th) image and an (i−1)^(th) image in ato-be-detected video stream into a target deep learning model, i beingan integer greater than 1. In other words, two adjacent images in theto-be-detected video stream are inputted into the target deep learningmodel for the subsequent target detection and tracking.

Step S102: detecting a target in the i^(th) image to obtain a firsttarget detection box, and tracking the target in the (i−1)^(th) image toobtain a tracking heatmap. In other words, through the target deeplearning model, the target in the i^(th) image is detected so as toobtain the first target detection box, and the target in the (i−1)^(th)image is tracked so as to obtain the tracking heatmap.

As an instance, anchor boxes with different sizes and scales are presetin the target deep learning model, and the target is detected andtracked through the target deep learning model using the preset anchorbox, so as to obtain the first target detection box and the trackingheatmap. It should be appreciated that, the preset anchor box is anormalized anchor box. In a procedure of detecting the target in thei^(th) image to obtain the first target detection box, at first thetarget is detected to obtain a target anchor box and coordinates of acentral pixel, i.e., obtain an initial target detection box. The initialtarget detection box includes the target anchor box and the coordinatesof the central pixel. Next, through transformation (transformation of anarea and the coordinates of the central pixel), it is able to obtain thefirst target detection box corresponding to the i^(th) image.

Step S103: determining a target tracking result in accordance with thefirst target detection box, the tracking heatmap and the (i−1)^(th)image. The target tracking result is obtained in accordance with thefirst target detection box, the tracking heatmap and the (i−1)^(th)image, so as to track the target in the two adjacent images.

According to the target tracking method in the embodiments of thepresent disclosure, the two adjacent images, i.e., the i^(th) image andthe (i−1)^(th) image, are inputted into the target deep learning model,and the target is detected and tracked through the target deep learningmodel using a predetermined anchor box, so as to obtain a targetdetection result, i.e., the first target detection box and the trackingheatmap. Next, the target tracking result is determined in accordancewith the first target detection box, the tracking heatmap and the(i−1)^(th) image. As compared with a scheme where a target detectionmodel, a feature extraction model and a multi-target tracking model arerun in serial to track the target, in the embodiments of the presentdisclosure, it is able to obtain the first target detection box and thetracking heatmap through one target deep learning model, and determinethe target tracking result in accordance with the first target detectionbox, the tracking heatmap and the (i−1)^(th) image, thereby to improvethe target tracking efficiency.

As shown in FIG. 2, in a possible embodiment of the present disclosure,Step S103 of determining the target tracking result in accordance withthe first target detection box, the tracking heatmap and the (i−1)^(th)image includes: Step S1031 of determining a target tracking heatmap inthe tracking heatmap in accordance with an index of an anchor boxcorresponding to the first target detection box and coordinates of acenter of the first target detection box; and Step S1032 of determiningcoordinates of a center of the target in the first target detection boxon the (i−1)^(th) image in accordance with coordinates of a point with amaximum value in the target tracking heatmap. The target tracking resultincludes the coordinates of the center of the target in the first targetdetection box on the (i−1)^(th) image and the coordinates of the centerof the first target detection box.

During the detection, after the first target detection box, i.e., thetarget detection result, has been obtained through the preset anchorbox, the index of the anchor box corresponding to the first targetdetection box is determined. Then, the target tracking heatmap isdetermined in the tracking heatmap in accordance with the index of theanchor box and the coordinates of the center of the first targetdetection box, i.e., the target tracking heatmap corresponding to theindex of the anchor box and the coordinates of the center of the firsttarget detection box is determined in the tracking heatmap. Thecoordinates of the center of the target in the first target detectionbox is determined on the (i−1)^(th) image in accordance with coordinatesof the point with a maximum value in the target tracking heatmap, so asto obtain the target tracking result. The target tracking resultincludes the coordinates of the center of the target in the first targetdetection box on the (i−1)^(th) image as well as the coordinates of thecenter of the first target detection box. In other words, a trackingresult of the i^(th) image and the (i−1)^(th) image includes a pair ofcoordinates, i.e., the coordinates of the center of the target in thefirst target detection box on the i^(th) image and the coordinates ofthe center of the target on the (i−1)^(th) image.

In the embodiments of the present disclosure, the target trackingheatmap is determined in the tracking heatmap in accordance with theindex of the anchor box corresponding to the first target detection boxas well as the coordinates of the center of the first target detectionbox, and then the coordinates of the center of the target in the firsttarget detection box is determined on the (i−1)^(th) image in accordancewith the coordinates of the point with a maximum value in the targettracking heatmap, so as to obtain the target tracking result. Ascompared with a mode where a feature is extracted and then the target istracked through the multi-target tracking algorithm, in the trackingmode in the embodiments of the present disclosure, it is able to improvethe tracking efficiency.

In a possible embodiment of the present disclosure, subsequent todetermining the coordinates of the center of the target in the firsttarget detection box on the (i−1)^(th) image in accordance with thecoordinates of the point with a maximum value in the target trackingheatmap, the target tracking method further includes, in the case thatthe coordinates of the center match first coordinates, determining thatthe first target detection box in the i^(th) image and a second targetdetection box in the (i−1)^(th) image are a detection box for a sametarget. The first coordinates are coordinates of a center of the secondtarget detection box obtained after detecting the target in the(i−1)^(th) image.

In other words, prior to inputting the i^(th) image and the (i−1)^(th)image into the target deep learning model, the (i−1)^(th) image and an(i−2)^(th) image are inputted into the target deep learning model. Thetarget in the (i−1)^(th) image is detected to obtain the second targetdetection box, and the target in the (i−2)^(th) image is tracked toobtain a tracking heatmap. A procedure of determining a target trackingresult in accordance with the second target detection box, the trackingheatmap obtained through tracking the target in the (i−1)^(th) image aswell as the (i−2)^(th) image is similar to the above-mentioned procedureof determining the target tracking result in accordance with the firsttarget detection box, the tracking heatmap and the (i−1)^(th) image, butwith different results due to different inputted images. In other words,the target in the (i−1)^(th) image is detected in advance to obtain thesecond target detection box, and after the coordinates of the center ofthe target in the first target detection box has been determined on the(i−1)^(th) image, the coordinates of the center are compared with thecoordinates of the center of the second target detection box. In thecase that the coordinates of the center match the first coordinates, itdetermines that the first target detection box in the i^(th) image andthe second target detection box in the (i−1)^(th) image are a detectionbox for the same target; otherwise they are detection boxes fordifferent targets.

In the embodiments of the present disclosure, after obtaining thecoordinates of the center of the target in the first target detectionbox on the (i−1)^(t)′ image, it is necessary to determine whether thecoordinates of the center match the first coordinates. In the case thatthe coordinates of the center match the first coordinates, the firsttarget detection box in the i^(th) image and the second target detectionbox in the (i−1)^(th) image are determined to be a detection box for thesame target. In this way, it is able to determine whether the firsttarget detection box in the i^(th) image and the second target detectionbox in the (i−1)^(th) image are a detection box for the same target,thereby to track the target accurately.

In a possible embodiment of the present disclosure, the target deeplearning model includes a neural network, a feature pyramid network, atarget detection branch, and a target tracking branch. The detecting thetarget in the i^(th) image to obtain the first target detection box andtracking the target in the (i−1)^(th) image to obtain the trackingheatmap includes: processing the i^(th) image and the (i−1)^(th) imagethrough the neural network, so as to output a plurality of first featuremaps; processing the plurality of first feature maps through the featurepyramid network, so as to output a plurality of second feature maps;detecting the target through the target detection branch in accordancewith the plurality of second feature maps and the preset anchor box, soas to determine the first target detection box in the i^(th) image; andtracking the target through the target tracking branch in accordancewith the plurality of second feature maps, so as to obtain multipleclasses of tracking heatmaps.

In other words, two adjacent images, i.e., the i^(th) image and the(i−1)^(th) image, are inputted into the target deep learning model, thefirst feature maps with different dimensions are obtained through theneural network (e.g., DarkNet or ResNet), and the first feature maps areinputted into the feature pyramid network to obtain the second featuremaps with different dimensions. The second feature maps with differentdimensions are used to sense targets with different dimensions in thei^(th) image in a descending order of the dimensions.

The plurality of second feature maps is inputted into the targetdetection branch (including a plurality of first convolutional layersconnected in series). A plurality of initial detection results whosechannel has a same size as respective second feature map is obtained inaccordance with the plurality of second feature maps and the presetanchor. A detection box with a maximum detection probability (aprobability that the target belongs to a certain class) is taken as aninitial target detection box. A scale of the initial target detectionbox is transformed in accordance with a size of the i^(th) image and asize of the second feature map corresponding to the initial targetdetection box, so as to obtain the first target detection box.

The plurality of second feature maps is inputted into the targettracking branch (including a plurality of second convolutional layersconnected in series), so as to process the plurality of second featuremaps to obtain the multi-class tracking heatmap. A target class trackingheatmap is determined in the multi-class tracking heatmap in accordancewith the index of the anchor box corresponding to the first targetdetection box as well as the coordinates of the center of the firsttarget detection box, and the target tracking heatmap is determined inthe target class tracking heatmap. A size of the target class trackingheatmap is the same as a size of the second feature map corresponding tothe first target detection box.

In the embodiments of the present disclosure, the i^(th) image and the(i−1)^(th) image are processed through the neural network in the targetdeep learning model, so as to output the plurality of first featuremaps. The plurality of first feature maps is processed through thefeature pyramid network, so as to output the plurality of second featuremaps. The target is detected through the target detection branch inaccordance with the plurality of second feature maps and the presetanchor box, so as to determine the first target detection box in thei^(th) image. The target is tracked through the target tracking branchin accordance with the plurality of second feature maps, so as to obtainthe multi-classes of tracking heatmaps. In this way, it is able toimprove the accuracy of the tracking heatmap.

In a possible embodiment of the present disclosure, each class ofheatmap includes W*H*A channels, where A represents the quantity ofpreset anchor boxes and it is an integer greater than 1, and W*Hrepresents a size of the second feature map corresponding to each classof heatmap. A channel index of the target tracking heatmap is positivelyrelated to the index of the anchor box and the coordinates of the centerof the first target detection box.

Each tracking heatmap corresponds to one anchor box and coordinates ofone center. A target channel index is determined in accordance with theindex of the anchor box of the first target detection box and thecoordinates of the center of the first target detection box, so as todetermine a corresponding target tracking heatmap. For example, achannel index c of the target tracking heatmap is determined through thefollowing equation: c=(a*W*H+j*s1*H+i*s2), where a represents the indexof the anchor box, (i, j) represents the coordinates of the center ofthe first target detection box, s1=heatmap_width/image_width,s2=heatmap_height/image_height, and * represents a multiplication sign.

In other words, in the embodiments of the present disclosure, thechannel index is determined in accordance with the index of the anchorbox and the coordinates of the center of the first target detection box,so as to determine the corresponding target tracking heatmap. In thisway, it is able to improve the efficiency for determining the targettracking heatmap.

As shown in FIG. 3, the above-mentioned method will be describedhereinafter in more details in conjunction with a specific embodiment.

An image is extracted from a real-time video stream obtained by amonitoring camera or the like. The image is pre-processed and scaledinto a fixed size (e.g., 416*416), and same RGB average values (e.g.,[104, 117, 123]) are subtracted therefrom. The image is pre-processed soas to obtain a same image size and improve the robustness of the model.In the embodiments of the present disclosure, it is necessary topre-process two adjacent images from the video stream, and inputted intothe model. For example, the two images are marked as Pi−1 and Pirespectively.

The pre-processed two images are inputted into, and calculated by, atarget detection and tracking model (i.e., the target deep learningmodel). A basic framework of the model comes from a You Only Look Once:Unified, Real-Time Object Detection (YOLO) model.

As shown in FIG. 3, the pre-processed two images Pi−1 and Pi areinputted into the target deep learning model, and processed by abackbone network (i.e., the neural network, e.g., DarkNet or ResNet) toobtain the first feature maps with different dimensions. The firstfeature maps are inputted into the feature pyramid network to obtainthree second feature maps y1, y2 and y3 with three different dimensions,i.e., 13*13*255, 26*26*255 and 52*52*255. The three second feature mapswith different dimensions are used to sense targets with differentdimensions in the image Pi in a descending order of the dimensions.

It is preset that A different kinds of preset anchor boxes withdifferent scales are generated at a pixel position of each secondfeature map, and the target detection branch provides an output having alength of (5+N) for each anchor box, so as to indicate a prediction(conf, x, y, W, H, class) of the target detection box on the basis ofthe anchor box, where conf represents a confidence level of a target inthe anchor box, x and y represent an abscissa and an ordinate of anormalized detection box respectively, W and H represent a size of thedetection box, and class is a vector with a length of N. A probabilitythat a target belongs to a certain class corresponds to a value in avector corresponding to an index of the class. For example, when thereare N classes, the vector class has a lens of N. When a target belongsto a certain class, an element at a corresponding position in the vectorhas a value of 1, and the other N−1 elements each have a value of 0. Forexample, when a target belongs to a second class, a second element inthe vector class has a value of 1, and the other elements each have avalue of 0. A second feature map y1 is inputted into the targetdetection branch, so as to obtain a detection prediction result z1indicating that the quantity of channels is (5+N)*A and each of a widthand a height is 13. Identically, a second feature map y2 is inputtedinto the target detection branch, so as to obtain a detection predictionresult z2 indicating that the quantity of channels is (5+N)*A and eachof a width and a height is 26. A third feature map y3 is inputted intothe target detection branch, so as to obtain a detection predictionresult z3 indicating that the quantity of channels is (5+N)*A and eachof a width and a height is 13. In other words, z1, z2 and z3 are resultsobtained after detecting the target in the image Pi. The first targetdetection box in the image Pi, e.g., a detection box corresponding to amaximum probability belonging to a certain class, is determined inaccordance with z1, z2 and z3.

The first feature map y1 is inputted into the target tracking branch, soas to obtain a tracking prediction result o1 indicating that thequantity of channels is 13*13*A and each of a width and a height is 13.The second feature map y2 is inputted into the target tracking branch,so as to obtain a tracking prediction result o2 indicating that thequantity of channels is 26*26*A and each of a width and a height is 26.The second feature map y3 is inputted into the target tracking branch,so as to obtain a tracking prediction result o3 indicating that thequantity of channels is 52*52*A and each of a width and a height is 52.

It should be appreciated that, in FIG. 3, DarkNet is taken as an exampleof the neural network, where Darknetconv2d_BN_Leaky (DBL) representsconvolution+Batch Normalization (BN)+Leaky relu (leaky rectified linearunit), resn (e.g., res1, res2, . . . , res8) represents the quantity ofres_units in each res_block, and concat represents the concatenation oftensors. Upsampling results at a middle layer and a layer after themiddle layer in DarkNet are concatenated, and a dimension of a tensor isexpanded due to the concatenation. In FIG. 3, cony represents aconvolutional layer.

During the training of the model, a training target generation mode ofthe target detection branch is the same as the other method. Thereexists a correspondence between channels of the target tracking branchand pixels in the target tracking branch. When a training target isgenerated on a pixel (x, y) of a detection branch z1 and an a^(th) typeof anchor box in accordance with a true value of a detection box for acertain target in Pi, a heatmap is generated on a channel(a*13*13+y*13+x) of the target tracking branch, where (a*13*13+y*13+x)represents a channel index of the target in a tracking result of thetracking branch. The heatmap is a Gaussian response map with a centerhaving coordinates obtained after transforming the coordinates of thecenter of a target in the image Pi−1 corresponding to the target in Pi(e.g., an abscissa of the center is multiplied by s1, and an ordinate ismultiplied by s2) as a center and with σ as a variance. A Gaussian peakvalue is 1, and each pixel spaced apart from a Gaussian center by morethan 36 has a value of 0, as shown in FIG. 4. Through this design, eachanchor box in the target detection branch uniquely corresponds to onetracking heatmap of a target within a true value of a detection boxmatching the anchor box.

For example, A anchor boxes are generated in z1 for each pixel point,and there are totally 13*13*A anchor boxes corresponding to 13*13*Aheatmaps in o1. Hence, a heatmap prediction result corresponding to ananchor box of an a^(th) type in an I^(th) row and a J^(th) column in thetarget detection branch z1 is a C^(th) channel in o 1, where indexC=a*13*13+J*13+I. Each channel corresponds a tracking heatmap having awidth of 13 and a height of 13.

A anchor boxes are generated in z2 for each pixel point, and there aretotally 26*26*A anchor boxes corresponding to 26*26*A heatmaps in o2.Hence, a heatmap prediction result corresponding to an anchor box of ana^(th) type in an I^(th) row and a J^(th) column in the target detectionbranch z2 is a C^(th) channel in o2, where the index C=a*26*26+J*26+I.Each channel corresponds a tracking heatmap having a width of 26 and aheight of 26.

A anchor boxes are generated in z3 for each pixel point, and there aretotally 52*52*A anchor boxes corresponding to 52*52*A heatmaps in o2.Hence, a heatmap prediction result corresponding to an anchor box of ana^(th) type in an I^(th) row and a J^(th) column in the target detectionbranch z3 is a C^(th) channel in o3, where the index C=a*52*52+J*52+I.Each channel corresponds a tracking heatmap having a width of 52 and aheight of 52.

During the prediction, the pre-processed image is inputted into thenetwork to obtain a detection prediction result and a trackingprediction result. A detection output result is post-processed, andcoordinates and a scale of the detection box of an original image arecalculated in accordance with coordinates and a scale of a normalizeddetection box. In addition, the result is processed through aNon-Maximum Suppression (NMS) algorithm, so as to obtain a selectedfinal detection box output result. A corresponding target trackingheatmap is obtained from the target tracking branch in accordance withan index of the anchor box corresponding to the detection box outputresult, and a correspondence between the anchor box and the targettracking heatmap is identical to that mentioned hereinabove. A pointwith a maximum value is selected from the target tracking heatmap, andcoordinates of the point (iz, jz) is obtained, so coordinates of acenter of a target in the image Pi−1 corresponding to the target in theimage Pi are (iz/heaimap_widih*image_widih, jz/heaimap_heighi*image_heighi), where heaimap_widih and heaimap_heighi represent a widthand a height of the target tracking heatmap in the target trackingbranch respectively. In this way, with respect to all detection boxesacquired through prediction, it is able to obtain predicted coordinatesof a center of a same target in a previous image.

Pi−1 and Pi are inputted into the model to obtain a detection result anda tracking result, which correspond to the coordinates of the target inthe image Pi and the coordinates of the center of the same target in theimage Pi−1 respectively. Pi and Pi+1 are inputted into the model toobtain a detection result and a tracking result, which correspond to thecoordinates of the center the target in the image Pi+1 and thecoordinates of the center of the same target in the image Pirespectively. Tracking results on the image Pi+1 obtained in the latterare compared with detection results on the image Pi obtained in theformer respectively (e.g., through a Hungary algorithm), and two targetsmatching each other are just a same target in the two adjacent images.The above steps are repeated continuously, so as to track the target inthe video stream.

In the embodiments of the present disclosure, the features are extractedfrom the image once through the deep learning model, so as to obtain thedetection boxes for all targets in the image as well as an associationresult of a same target in two adjacent images. As compared with aconventional method, the detection result and the tracking result areoutputted simultaneously through a one-stage model, so it is able toreduce computational resource overheard to the greatest extent, and omita step of extracting an explicit feature, thereby to enable the model tosense the same target in the two adjacent images through end-to-endtraining of a large quantity of consecutive images.

As shown in FIG. 5, the present disclosure provides in some embodimentsa target tracking device 500, which includes: an input module 501configured to input an i^(th) image and an (i−1)^(th) image in ato-be-detected video stream into a target deep learning model, i beingan integer greater than 1; a detection and tracking module 502configured to detect a target in the i^(th) image to obtain a firsttarget detection box, and track the target in the (i−1)^(th) image toobtain a tracking heatmap; and a determination module 503 configured todetermine a target tracking result in accordance with the first targetdetection box, the tracking heatmap and the (i−1)^(th) image.

In a possible embodiment of the present disclosure, the determinationmodule 503 includes: a first determination module configured todetermine a target tracking heatmap in the tracking heatmap inaccordance with an index of an anchor box corresponding to the firsttarget detection box and coordinates of a center of the first targetdetection box; and a second determination module configured to determinecoordinates of a center of the target in the first target detection boxon the (i−1)^(th) image in accordance with coordinates of a point with amaximum value in the target tracking heatmap. The target tracking resultincludes the coordinates of the center of the target in the first targetdetection box on the (i−1)^(th) image and the coordinates of the centerof the first target detection box.

In a possible embodiment of the present disclosure, the target trackingdevice 500 further includes a third determination module configured to,after the second determination module has determined the coordinates ofthe center of the target in the first target detection box on the(i−1)^(th) image in accordance with the coordinates of the point withthe maximum value in the target tracking heatmap, in the case that thecoordinates of the center match first coordinates, determine that thefirst target detection box in the i^(th) image and a second targetdetection box in the (i−1)^(th) image are a detection box for a sametarget. The first coordinates are coordinates of a center of the secondtarget detection box obtained after detecting the target in the(i−1)^(th) image.

In a possible embodiment of the present disclosure, the detection andtracking module includes: a first processing module configured toprocess the i^(th) image and the (i−1)^(th) image through a neuralnetwork, so as to output a plurality of first feature maps; a secondprocessing module configured to process the plurality of first featuremaps through a feature pyramid network, so as to output a plurality ofsecond feature maps; a detection module configured to detect the targetthrough a target detection branch in accordance with the plurality ofsecond feature maps and the preset anchor box, so as to determine thefirst target detection box in the i^(th) image; and a tracking moduleconfigured to track the target through a target tracking branch inaccordance with the plurality of second feature maps, so as to obtainmultiple classes of tracking heatmaps.

In a possible embodiment of the present disclosure, each class ofheatmap includes W*H*A channels, where A represents the quantity ofpreset anchor boxes and it is an integer greater than 1, and W*Hrepresents a size of the second feature map corresponding to each classof heatmap. A channel index of the target tracking heatmap is positivelyrelated to the index of the anchor box and the coordinates of the centerof the first target detection box.

The target tracking device in the embodiments of the present disclosureis used to implement the above-mentioned target tracking method withsame technical features and same technical effects, which will thus notbe particularly defined herein.

The present disclosure further provides in some embodiments anelectronic apparatus, a computer-readable storage medium and a computerprogram product.

The non-transient computer-readable storage medium in the embodiments ofthe present disclosure is configured to store therein a computerinstruction. The computer instruction is executed by a computer so as toimplement the above-mentioned target tracking method.

The computer program product in the embodiments of the presentdisclosure includes a computer program. The computer program is executedby a computer so as to implement the above-mentioned target trackingmethod.

FIG. 6 is a schematic block diagram of an exemplary electronic apparatus600 in which embodiments of the present disclosure may be implemented.The electronic apparatus is intended to represent all kinds of digitalcomputers, such as a laptop computer, a desktop computer, a workstation, a personal digital assistant, a server, a blade server, a mainframe or other suitable computers. The electronic apparatus may alsorepresent all kinds of mobile devices, such as a personal digitalassistant, a cell phone, a smart phone, a wearable device and othersimilar computing devices. The components shown here, their connectionsand relationships, and their functions, are meant to be exemplary only,and are not meant to limit implementations of the present disclosuredescribed and/or claimed herein.

As shown in FIG. 6, the electronic apparatus 600 includes a computingunit 601 configured to execute various suitable actions and processingsin accordance with computer programs stored in a Read Only Memory (ROM)602 or computer programs loaded into a Random Access Memory (RAM) 603via a storage unit 606. Various programs and data desired for theoperation of the electronic apparatus 600 may also be stored in the RAM603. The computing unit 601, the ROM 602 and the RAM 603 may beconnected to each other via a bus 604. In addition, an input/output(I/O) interface 605 may also be connected to the bus 604.

Multiple components in the electronic apparatus 600 are connected to theI/O interface 605. The multiple components include: an input unit 606,e.g., a keyboard, a mouse and the like; an output unit 607, e.g., avariety of displays, loudspeakers, and the like; a storage unit 606,e.g., a magnetic disk, an optic disk and the like; and a communicationunit 609, e.g., a network card, a modem, a wireless transceiver, and thelike. The communication unit 609 allows the electronic apparatus 600 toexchange information/data with other devices through a computer networkand/or other telecommunication networks, such as the Internet.

The computing unit 601 may be any general purpose and/or special purposeprocessing components having a processing and computing capability. Someexamples of the computing unit 601 include, but are not limited to: acentral processing unit (CPU), a graphic processing unit (GPU), variousspecial purpose artificial intelligence (AI) computing chips, variouscomputing units running a machine learning model algorithm, a digitalsignal processor (DSP), and any suitable processor, controller,microcontroller, etc. The computing unit 601 carries out theaforementioned methods and processes, e.g., the target tracking method.For example, in some embodiments of the present disclosure, the targettracking method may be implemented as a computer software programtangibly embodied in a machine readable medium such as the storage unit606. In some embodiments of the present disclosure, all or a part of thecomputer program may be loaded and/or installed on the electronicapparatus 600 through the ROM 602 and/or the communication unit 609.When the computer program is loaded into the RAM 603 and executed by thecomputing unit 601, one or more steps of the foregoing target trackingmethod may be implemented. Optionally, in some other embodiments of thepresent disclosure, the computing unit 601 may be configured in anyother suitable manner (e.g., by means of firmware) to implement thetarget tracking method. Various implementations of the aforementionedsystems and techniques may be implemented in a digital electroniccircuit system, an integrated circuit system, a field-programmable gatearray (FPGA), an application specific integrated circuit (ASIC), anapplication specific standard product (ASSP), a system on a chip (SOC),a complex programmable logic device (CPLD), computer hardware, firmware,software, and/or a combination thereof. The various implementations mayinclude an implementation in form of one or more computer programs. Theone or more computer programs may be executed and/or interpreted on aprogrammable system including at least one programmable processor. Theprogrammable processor may be a special purpose or general purposeprogrammable processor, may receive data and instructions from a storagesystem, at least one input device and at least one output device, andmay transmit data and instructions to the storage system, the at leastone input device and the at least one output device.

Program codes for implementing the methods of the present disclosure maybe written in one programming language or any combination of multipleprogramming languages. These program codes may be provided to aprocessor or controller of a general purpose computer, a special purposecomputer, or other programmable data processing device, such that thefunctions/operations specified in the flow diagram and/or block diagramare implemented when the program codes are executed by the processor orcontroller. The program codes may be run entirely on a machine, runpartially on the machine, run partially on the machine and partially ona remote machine as a standalone software package, or run entirely onthe remote machine or server.

In the context of the present disclosure, the machine readable mediummay be a tangible medium, and may include or store a program used by aninstruction execution system, device or apparatus, or a program used inconjunction with the instruction execution system, device or apparatus.The machine readable medium may be a machine readable signal medium or amachine readable storage medium. The machine readable medium includes,but is not limited to: an electronic, magnetic, optical,electromagnetic, infrared, or semiconductor system, device or apparatus,or any suitable combination thereof. A more specific example of themachine readable storage medium includes: an electrical connection basedon one or more wires, a portable computer disk, a hard disk, a randomaccess memory (RAM), a read only memory (ROM), an erasable programmableread only memory (EPROM or flash memory), an optic fiber, a portablecompact disc read only memory (CD-ROM), an optical storage device, amagnetic storage device, or any suitable combination thereof.

To facilitate user interaction, the system and technique describedherein may be implemented on a computer. The computer is provided with adisplay device (for example, a cathode ray tube (CRT) or liquid crystaldisplay (LCD) monitor) for displaying information to a user, a keyboardand a pointing device (for example, a mouse or a track ball). The usermay provide an input to the computer through the keyboard and thepointing device. Other kinds of devices may be provided for userinteraction, for example, a feedback provided to the user may be anymanner of sensory feedback (e.g., visual feedback, auditory feedback, ortactile feedback); and input from the user may be received by any means(including sound input, voice input, or tactile input).

The system and technique described herein may be implemented in acomputing system that includes a back-end component (e.g., as a dataserver), or that includes a middle-ware component (e.g., an applicationserver), or that includes a front-end component (e.g., a client computerhaving a graphical user interface or a Web browser through which a usercan interact with an implementation of the system and technique), or anycombination of such back-end, middleware, or front-end components. Thecomponents of the system can be interconnected by any form or medium ofdigital data communication (e.g., a communication network). Examples ofcommunication networks include a local area network (LAN), a wide areanetwork (WAN), the Internet and a blockchain network.

The computer system can include a client and a server. The client andserver are generally remote from each other and typically interactthrough a communication network. The relationship of client and serverarises by virtue of computer programs running on respective computersand having a client-server relationship to each other. The server may bea cloud server, also called as cloud computing server or cloud server,which is a host product in a cloud calculating service system, so as toovercome such defects as large management difficulty and insufficientservice extensibility in a conventional physical host and a VirtualPrivate Server (VPS). The server may also be a server of a distributedsystem, or a server combined with blockchain.

It should be appreciated that, all forms of processes shown above may beused, and steps thereof may be reordered, added or deleted. For example,as long as expected results of the technical solutions of the presentdisclosure can be achieved, steps set forth in the present disclosuremay be performed in parallel, performed sequentially, or performed in adifferent order, and there is no limitation in this regard.

The foregoing specific implementations constitute no limitation on thescope of the present disclosure. It is appreciated by those skilled inthe art, various modifications, combinations, sub-combinations andreplacements may be made according to design requirements and otherfactors. Any modifications, equivalent replacements and improvementsmade without deviating from the spirit and principle of the presentdisclosure shall be deemed as falling within the scope of the presentdisclosure.

What is claimed is:
 1. A target tracking method, comprising: inputting an i^(th) image and an (i−1)^(th) image in a to-be-detected video stream into a target deep learning model, i being an integer greater than 1; detecting a target in the i^(th) image to obtain a first target detection box, and tracking the target in the (i−1)^(th) image to obtain a tracking heatmap; and determining a target tracking result in accordance with the first target detection box, the tracking heatmap and the (i−1)^(th) image.
 2. The target tracking method according to claim 1, wherein determining the target tracking result comprises: determining a target tracking heatmap in the tracking heatmap in accordance with an index of an anchor box corresponding to the first target detection box and coordinates of a center of the first target detection box; and determining coordinates of a center of the target in the first target detection box on the (i−1)^(th) image in accordance with coordinates of a point with a maximum value in the target tracking heatmap, wherein the target tracking result comprises the coordinates of the center of the target in the first target detection box on the (i−1)^(th) image and the coordinates of the center of the first target detection box.
 3. The target tracking method according to claim 2, wherein subsequent to determining the coordinates of the center of the target in the first target detection box on the (i−1)^(th) image in accordance with the coordinates of the point with a maximum value in the target tracking heatmap, the target tracking method further comprises, in the case that the coordinates of the center match first coordinates, determining that the first target detection box in the i^(th) image and a second target detection box in the (i−1)^(th) image are a detection box for a same target, wherein the first coordinates are coordinates of a center of the second target detection box obtained after detecting the target in the (i−1)^(th) image.
 4. The target tracking method according to claim 2, wherein detecting the target in the i^(th) image to obtain the first target detection box and tracking the target in the (i−1)^(th) image to obtain the tracking heatmap comprises: processing the i^(th) image and the (i−1)^(th) image through a neural network to output a plurality of first feature maps; processing the plurality of first feature maps through a feature pyramid network to output a plurality of second feature maps; detecting the target through a target detection branch in accordance with the plurality of second feature maps and the anchor box to determine the first target detection box in the i^(th) image; and tracking the target through a target tracking branch in accordance with the plurality of second feature maps to obtain multiple classes of tracking heatmaps.
 5. The target tracking method according to claim 4, wherein each class of heatmap comprises W*H*A channels, wherein A represents a quantity of anchor boxes and is an integer greater than 1, and W*H represents a size of the second feature map corresponding to a class of heatmap, wherein a channel index of the target tracking heatmap is positively related to the index of the anchor box and the coordinates of the center of the first target detection box.
 6. An electronic apparatus, comprising: at least one processor; and a memory in communication connection with the at least one processor, the memory configured to store therein instructions executed by the at least one processor, and wherein the at least one processor is configured to execute the instructions to implement a target tracking method comprising: inputting an i^(th) image and an (i−1)^(th) image in a to-be-detected video stream into a target deep learning model, i being an integer greater than 1; detecting a target in the i^(th) image to obtain a first target detection box, and tracking the target in the (i−1)^(th) image to obtain a tracking heatmap; and determining a target tracking result in accordance with the first target detection box, the tracking heatmap and the (i−1)^(th) image.
 7. The electronic apparatus according to claim 6, wherein determining the target tracking result comprises: determining a target tracking heatmap in the tracking heatmap in accordance with an index of an anchor box corresponding to the first target detection box and coordinates of a center of the first target detection box; and determining coordinates of a center of the target in the first target detection box on the (i−1)^(th) image in accordance with coordinates of a point with a maximum value in the target tracking heatmap, wherein the target tracking result comprises the coordinates of the center of the target in the first target detection box on the (i−1)^(th) image and the coordinates of the center of the first target detection box.
 8. The electronic apparatus according to claim 7, wherein subsequent to determining the coordinates of the center of the target in the first target detection box on the (i−1)^(th) image in accordance with the coordinates of the point with a maximum value in the target tracking heatmap, the at least one processor is configured to execute the instructions to: in the case that the coordinates of the center match first coordinates, determine that the first target detection box in the i^(th) image and a second target detection box in the (i−1)^(th) image are a detection box for a same target, wherein the first coordinates are coordinates of a center of the second target detection box obtained after detecting the target in the (i−1)^(th) image.
 9. The electronic apparatus according to claim 7, wherein detecting the target in the i^(th) image to obtain the first target detection box and tracking the target in the (i−1)^(th) image to obtain the tracking heatmap comprises: processing the i^(th) image and the (i−1)^(th) image through a neural network to output a plurality of first feature maps; processing the plurality of first feature maps through a feature pyramid network to output a plurality of second feature maps; detecting the target through a target detection branch in accordance with the plurality of second feature maps and the anchor box to determine the first target detection box in the i^(th) image; and tracking the target through a target tracking branch in accordance with the plurality of second feature maps to obtain multiple classes of tracking heatmaps.
 10. The electronic apparatus according to claim 9, wherein each class of heatmap comprises W*H*A channels, wherein A represents a quantity of anchor boxes and is an integer greater than 1, and W*H represents a size of the second feature map corresponding to a class of heatmap, wherein a channel index of the target tracking heatmap is positively related to the index of the anchor box and the coordinates of the center of the first target detection box.
 11. A non-transient computer-readable storage medium storing therein a computer instruction, wherein the computer instruction is executed by a computer to implement a target tracking method comprising: inputting an i^(th) image and an (i−1)^(th) image in a to-be-detected video stream into a target deep learning model, i being an integer greater than 1; detecting a target in the i^(th) image to obtain a first target detection box, and tracking the target in the (i−1)^(th) image to obtain a tracking heatmap; and determining a target tracking result in accordance with the first target detection box, the tracking heatmap and the (i−1)^(th) image.
 12. The non-transient computer-readable storage medium according to claim 11, wherein determining the target tracking result comprises: determining a target tracking heatmap in the tracking heatmap in accordance with an index of an anchor box corresponding to the first target detection box and coordinates of a center of the first target detection box; and determining coordinates of a center of the target in the first target detection box on the (i−1)^(th) image in accordance with coordinates of a point with a maximum value in the target tracking heatmap, wherein the target tracking result comprises the coordinates of the center of the target in the first target detection box on the (i−1)^(th) image and the coordinates of the center of the first target detection box.
 13. The non-transient computer-readable storage medium according to claim 12, wherein subsequent to determining the coordinates of the center of the target in the first target detection box on the (i−1)^(th) image in accordance with the coordinates of the point with a maximum value in the target tracking heatmap, the computer instruction is executed by the computer to: in the case that the coordinates of the center match first coordinates, determine that the first target detection box in the i^(th) image and a second target detection box in the (i−1)^(th) image are a detection box for a same target, wherein the first coordinates are coordinates of a center of the second target detection box obtained after detecting the target in the (i−1)^(th) image.
 14. The non-transient computer-readable storage medium according to claim 12, wherein detecting the target in the i^(th) image to obtain the first target detection box and tracking the target in the (i−1)^(th) image to obtain the tracking heatmap comprises: processing the i^(th) image and the (i−1)^(th) image through a neural network to output a plurality of first feature maps; processing the plurality of first feature maps through a feature pyramid network to output a plurality of second feature maps; detecting the target through a target detection branch in accordance with the plurality of second feature maps and the anchor box to determine the first target detection box in the i^(th) image; and tracking the target through a target tracking branch in accordance with the plurality of second feature maps to obtain multiple classes of tracking heatmaps.
 15. The non-transient computer-readable storage medium according to claim 14, wherein each class of heatmap comprises W*H*A channels, wherein A represents a quantity of anchor boxes and is an integer greater than 1, and W*H represents a size of the second feature map corresponding to a class of heatmap, wherein a channel index of the target tracking heatmap is positively related to the index of the anchor box and the coordinates of the center of the first target detection box.
 16. A computer program product, comprising a computer program, wherein the computer program is executed by a processor to implement the target tracking method according to claim
 1. 17. The computer program product according to claim 16, wherein determining the target tracking result comprises: determining a target tracking heatmap in the tracking heatmap in accordance with an index of an anchor box corresponding to the first target detection box and coordinates of a center of the first target detection box; and determining coordinates of a center of the target in the first target detection box on the (i−1)^(th) image in accordance with coordinates of a point with a maximum value in the target tracking heatmap, wherein the target tracking result comprises the coordinates of the center of the target in the first target detection box on the (i−1)^(th) image and the coordinates of the center of the first target detection box.
 18. The computer program product according to claim 17, wherein subsequent to determining the coordinates of the center of the target in the first target detection box on the (i−1)^(th) image in accordance with the coordinates of the point with a maximum value in the target tracking heatmap, the computer program is executed by the processor to: in the case that the coordinates of the center match first coordinates, determine that the first target detection box in the i^(th) image and a second target detection box in the (i−1)^(th) image are a detection box for a same target, wherein the first coordinates are coordinates of a center of the second target detection box obtained after detecting the target in the (i−1)^(th) image.
 19. The computer program product according to claim 17, wherein detecting the target in the i^(th) image to obtain the first target detection box and tracking the target in the (i−1)^(th) image to obtain the tracking heatmap comprises: processing the i^(th) image and the (i−1)^(th) image through a neural network to output a plurality of first feature maps; processing the plurality of first feature maps through a feature pyramid network to output a plurality of second feature maps; detecting the target through a target detection branch in accordance with the plurality of second feature maps and the anchor box to determine the first target detection box in the i^(th) image; and tracking the target through a target tracking branch in accordance with the plurality of second feature maps to obtain multiple classes of tracking heatmaps.
 20. The computer program product according to claim 19, wherein each class of heatmap comprises W*H*A channels, wherein A represents a quantity of anchor boxes and is an integer greater than 1, and W*H represents a size of the second feature map corresponding to a class of heatmap, wherein a channel index of the target tracking heatmap is positively related to the index of the anchor box and the coordinates of the center of the first target detection box. 